{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#参数解读https://blog.csdn.net/TeFuirnever/article/details/99818078"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment</th>\n",
       "      <th>sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>口味：不知道是我口高了，还是这家真不怎么样。??我感觉口味确实很一般很一般。上菜相当快，我敢...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>菜品丰富质量好，服务也不错！很喜欢！</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>说真的，不晓得有人排队的理由，香精香精香精香精，拜拜！</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>菜量实惠，上菜还算比较快，疙瘩汤喝出了秋日的暖意，烧茄子吃出了大阪烧的味道，想吃土豆片也是口...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>先说我算是娜娜家风荷园开业就一直在这里吃??每次出去回来总想吃一回??有时觉得外面的西式简餐...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             comment  sentiment\n",
       "0  口味：不知道是我口高了，还是这家真不怎么样。??我感觉口味确实很一般很一般。上菜相当快，我敢...          0\n",
       "1                                 菜品丰富质量好，服务也不错！很喜欢！          1\n",
       "2                        说真的，不晓得有人排队的理由，香精香精香精香精，拜拜！          0\n",
       "3  菜量实惠，上菜还算比较快，疙瘩汤喝出了秋日的暖意，烧茄子吃出了大阪烧的味道，想吃土豆片也是口...          1\n",
       "4  先说我算是娜娜家风荷园开业就一直在这里吃??每次出去回来总想吃一回??有时觉得外面的西式简餐...          1"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_excel('data_test_train.xlsx')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### KNN算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据预处理 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#根据任务需要做处理\n",
    "#数据去重、删除空值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### jieba分词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache C:\\Users\\17888\\AppData\\Local\\Temp\\jieba.cache\n",
      "Loading model cost 1.449 seconds.\n",
      "Prefix dict has been built successfully.\n"
     ]
    }
   ],
   "source": [
    "import jieba\n",
    "#根据需要补充完善分词功能，可参照词频分析或者LDA主题模型的代码\n",
    "def chinese_word_cut(mytext):\n",
    "    return \" \".join(jieba.cut(mytext))\n",
    "\n",
    "data['cut_comment'] = data.comment.apply(chinese_word_cut)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>cut_comment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>口味：不知道是我口高了，还是这家真不怎么样。??我感觉口味确实很一般很一般。上菜相当快，我敢...</td>\n",
       "      <td>0</td>\n",
       "      <td>口味 ： 不 知道 是 我口 高 了 ， 还是 这家 真 不怎么样 。 ? ? 我 感觉 口...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>菜品丰富质量好，服务也不错！很喜欢！</td>\n",
       "      <td>1</td>\n",
       "      <td>菜品 丰富 质量 好 ， 服务 也 不错 ！ 很 喜欢 ！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>说真的，不晓得有人排队的理由，香精香精香精香精，拜拜！</td>\n",
       "      <td>0</td>\n",
       "      <td>说真的 ， 不 晓得 有人 排队 的 理由 ， 香精 香精 香精 香精 ， 拜拜 ！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>菜量实惠，上菜还算比较快，疙瘩汤喝出了秋日的暖意，烧茄子吃出了大阪烧的味道，想吃土豆片也是口...</td>\n",
       "      <td>1</td>\n",
       "      <td>菜量 实惠 ， 上菜 还 算 比较 快 ， 疙瘩汤 喝出 了 秋日 的 暖意 ， 烧茄子 吃...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>先说我算是娜娜家风荷园开业就一直在这里吃??每次出去回来总想吃一回??有时觉得外面的西式简餐...</td>\n",
       "      <td>1</td>\n",
       "      <td>先说 我 算是 娜娜 家风 荷园 开业 就 一直 在 这里 吃 ? ? 每次 出去 回来 总...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             comment  sentiment  \\\n",
       "0  口味：不知道是我口高了，还是这家真不怎么样。??我感觉口味确实很一般很一般。上菜相当快，我敢...          0   \n",
       "1                                 菜品丰富质量好，服务也不错！很喜欢！          1   \n",
       "2                        说真的，不晓得有人排队的理由，香精香精香精香精，拜拜！          0   \n",
       "3  菜量实惠，上菜还算比较快，疙瘩汤喝出了秋日的暖意，烧茄子吃出了大阪烧的味道，想吃土豆片也是口...          1   \n",
       "4  先说我算是娜娜家风荷园开业就一直在这里吃??每次出去回来总想吃一回??有时觉得外面的西式简餐...          1   \n",
       "\n",
       "                                         cut_comment  \n",
       "0  口味 ： 不 知道 是 我口 高 了 ， 还是 这家 真 不怎么样 。 ? ? 我 感觉 口...  \n",
       "1                      菜品 丰富 质量 好 ， 服务 也 不错 ！ 很 喜欢 ！  \n",
       "2         说真的 ， 不 晓得 有人 排队 的 理由 ， 香精 香精 香精 香精 ， 拜拜 ！  \n",
       "3  菜量 实惠 ， 上菜 还 算 比较 快 ， 疙瘩汤 喝出 了 秋日 的 暖意 ， 烧茄子 吃...  \n",
       "4  先说 我 算是 娜娜 家风 荷园 开业 就 一直 在 这里 吃 ? ? 每次 出去 回来 总...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 提取特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer  #TfidfVectorizer\n",
    "\n",
    "def get_custom_stopwords(stop_words_file):\n",
    "    with open(stop_words_file) as f:\n",
    "        stopwords = f.read()\n",
    "    stopwords_list = stopwords.split('\\n')\n",
    "    custom_stopwords_list = [i for i in stopwords_list]\n",
    "    return custom_stopwords_list\n",
    "\n",
    "stop_words_file = '哈工大停用词表.txt'\n",
    "stopwords = get_custom_stopwords(stop_words_file)\n",
    "\n",
    "vect = CountVectorizer(max_df = 0.8, \n",
    "                       min_df = 3, \n",
    "                       token_pattern=u'(?u)\\\\b[^\\\\d\\\\W]\\\\w+\\\\b', \n",
    "                       stop_words=frozenset(stopwords))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 划分数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#划分数据集\n",
    "X = data['cut_comment']\n",
    "y = data.sentiment\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\software\\Anaconda\\lib\\site-packages\\sklearn\\feature_extraction\\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['exp', 'lex', '①①', '①②', '①③', '①④', '①⑤', '①⑥', '①⑦', '①⑧', '①⑨', '①ａ', '①ｂ', '①ｃ', '①ｄ', '①ｅ', '①ｆ', '①ｇ', '①ｈ', '①ｉ', '①ｏ', '②①', '②②', '②③', '②④', '②⑤', '②⑥', '②⑦', '②⑧', '②⑩', '②ａ', '②ｂ', '②ｄ', '②ｅ', '②ｆ', '②ｇ', '②ｈ', '②ｉ', '②ｊ', '③①', '③⑩', '③ａ', '③ｂ', '③ｃ', '③ｄ', '③ｅ', '③ｆ', '③ｇ', '③ｈ', '④ａ', '④ｂ', '④ｃ', '④ｄ', '④ｅ', '⑤ａ', '⑤ｂ', '⑤ｄ', '⑤ｅ', '⑤ｆ', 'ｌｉ', 'ｚｘｆｉｔｌ'] not in stop_words.\n",
      "  warnings.warn('Your stop_words may be inconsistent with '\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ipad</th>\n",
       "      <th>ok</th>\n",
       "      <th>一下</th>\n",
       "      <th>一个个</th>\n",
       "      <th>一个半</th>\n",
       "      <th>一个多</th>\n",
       "      <th>一人</th>\n",
       "      <th>一份</th>\n",
       "      <th>一会</th>\n",
       "      <th>一会儿</th>\n",
       "      <th>...</th>\n",
       "      <th>麻烦</th>\n",
       "      <th>麻辣</th>\n",
       "      <th>麻酱</th>\n",
       "      <th>黄瓜</th>\n",
       "      <th>黄盖</th>\n",
       "      <th>黄色</th>\n",
       "      <th>黏糊糊</th>\n",
       "      <th>黑椒</th>\n",
       "      <th>默默</th>\n",
       "      <th>齐全</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 1871 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   ipad  ok  一下  一个个  一个半  一个多  一人  一份  一会  一会儿  ...  麻烦  麻辣  麻酱  黄瓜  黄盖  黄色  \\\n",
       "0     0   0   0    0    0    0   0   0   0    0  ...   0   0   0   0   0   0   \n",
       "1     0   0   0    0    0    0   0   0   0    0  ...   0   0   0   0   0   0   \n",
       "2     0   0   0    0    0    0   0   0   0    0  ...   0   0   0   0   0   0   \n",
       "3     0   0   0    0    0    0   0   0   0    0  ...   0   0   0   1   0   0   \n",
       "4     0   0   0    0    0    0   0   0   0    0  ...   0   0   0   0   0   0   \n",
       "\n",
       "   黏糊糊  黑椒  默默  齐全  \n",
       "0    0   0   0   0  \n",
       "1    0   0   0   0  \n",
       "2    0   0   0   0  \n",
       "3    0   0   0   0  \n",
       "4    0   0   0   0  \n",
       "\n",
       "[5 rows x 1871 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#特征展示\n",
    "test = pd.DataFrame(vect.fit_transform(X_train).toarray(), columns=vect.get_feature_names())\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 训练模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.0\n"
     ]
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "knn = KNeighborsClassifier(n_neighbors=3) #,weights='distance'\n",
    "\n",
    "X_train_vect = vect.fit_transform(X_train)\n",
    "knn.fit(X_train_vect, y_train)\n",
    "train_score = knn.score(X_train_vect, y_train)\n",
    "print(train_score)  # train_score 模型在训练数据上的效果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 测试模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6984126984126984\n"
     ]
    }
   ],
   "source": [
    "X_test_vect = vect.transform(X_test)\n",
    "print(knn.score(X_test_vect, y_test)) # 模型在测试数据上的效果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 分析数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>菜品口味还行，上菜太慢，一杯鲜榨果汁居然四十分钟才上。薯条非常好吃，是我吃过最好吃的</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>铁板蛏子都是沙子，酱爆海兔超咸，海参捞饭里的海参硬硬的。评论都是骗人的……</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>重盐??油大??号称东北菜??东北菜什么时候这么油了？唯一的一个素菜??那个油啊??唉茄盒创...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>有点流于表面??菜相当一般??甚至有点难吃??别的都OK??但是饭馆最重要的还是菜啊??很坑</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>太差了,见过次的没见过这么次的。首先说稍显不次的，平心而论他的口味实在是太对不起它的价格了价...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             comment\n",
       "0         菜品口味还行，上菜太慢，一杯鲜榨果汁居然四十分钟才上。薯条非常好吃，是我吃过最好吃的\n",
       "1              铁板蛏子都是沙子，酱爆海兔超咸，海参捞饭里的海参硬硬的。评论都是骗人的……\n",
       "2  重盐??油大??号称东北菜??东北菜什么时候这么油了？唯一的一个素菜??那个油啊??唉茄盒创...\n",
       "3     有点流于表面??菜相当一般??甚至有点难吃??别的都OK??但是饭馆最重要的还是菜啊??很坑\n",
       "4  太差了,见过次的没见过这么次的。首先说稍显不次的，平心而论他的口味实在是太对不起它的价格了价..."
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_excel(\"data.xlsx\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_excel(\"data.xlsx\")\n",
    "data['cut_comment'] = data.comment.apply(chinese_word_cut)\n",
    "X=data['cut_comment']\n",
    "X_vec = vect.transform(X)\n",
    "knn_result = knn.predict(X_vec)\n",
    "#predict_proba(X)[source] 返回概率\n",
    "data['knn_sentiment'] = knn_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.to_excel(\"data_result.xlsx\",index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "218px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
